Efficient Semantic-Aware Detection of Near Duplicate Resources

نویسندگان

Ekaterini Ioannou

Odysseas Papapetrou

Dimitrios Skoutas

Wolfgang Nejdl

چکیده

Efficiently detecting near duplicate resources is an important task when integrating information from various sources and applications. Once detected, near duplicate resources can be grouped together, merged, or removed, in order to avoid repetition and redundancy, and to increase the diversity in the information provided to the user. In this paper, we introduce an approach for efficient semantic-aware near duplicate detection, by combining an indexing scheme for similarity search with the RDF representations of the resources. We provide a probabilistic analysis for the correctness of the suggested approach, which allows applications to configure it for satisfying their specific quality requirements. Our experimental evaluation on the RDF descriptions of real-world news articles from various news agencies demonstrates the efficiency and effectiveness of our approach.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Consumer photo management and browsing facilitated by near-duplicate detection with feature filtering

Near-duplicate detection techniques are exploited to facilitate representative photo selection and region-of-interest (ROI) determination, which are important functionalities for efficient photo management and browsing. To make near-duplicate detection module resist to noisy features, three filtering approaches, i.e., point-based, region-based, and probabilistic latent semantic (pLSA), are deve...

متن کامل

Using Page Size for Controlling Duplicate Query Results in Semantic Web

Semantic web is a web of future. The Resource Description Framework (RDF) is a language to represent resources in the World Wide Web. When these resources are queried the problem of duplicate query results occurs. The present techniques used hash index comparison to remove duplicate query results. The major drawback of using the hash index to remove duplicate query results is that, if there is ...

متن کامل

An Efficient Approach for Near-duplicate page detection in web crawling

The drastic development of the World Wide Web in the recent times has made the concept of Web Crawling receive remarkable significance. The voluminous amounts of web documents swarming the web have posed huge challenges to the web search engines making their results less relevant to the users. The presence of duplicate and near duplicate web documents in abundance has created additional overhea...

متن کامل

Speed-up Multi-modal Near Duplicate Image Detection

Near-duplicate image detection is a necessary operation to refine image search results for efficient user exploration. The existences of large amounts of near duplicates require fast and accurate automatic near-duplicate detection methods. We have designed a coarse-to-fine near duplicate detection framework to speed-up the process and a multi-modal integration scheme for accurate detection. The...

متن کامل

Data cleansing base on subgraph comparison

With the quick development of the semantic web technology, RDF data explosion has become a challenging problem. Since RDF data are always from different resources, which may have overlap with each other, they could have duplicates. These duplicates may cause ambiguity and even error in reasoning. However, attentions are seldom paid to this problem. In this paper, we study the problem and give a...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2010

Efficient Semantic-Aware Detection of Near Duplicate Resources

نویسندگان

چکیده

منابع مشابه

Consumer photo management and browsing facilitated by near-duplicate detection with feature filtering

Using Page Size for Controlling Duplicate Query Results in Semantic Web

An Efficient Approach for Near-duplicate page detection in web crawling

Speed-up Multi-modal Near Duplicate Image Detection

Data cleansing base on subgraph comparison

عنوان ژورنال:

اشتراک گذاری